Phoebus: A System for Extracting and Integrating Data from Unstructured and Ungrammatical Sources
نویسندگان
چکیده
With the proliferation of online classifieds and auctions comes a new need to meaningfully search and organize the items for sale. However, since the seller’s item descriptions are not structured and do not conform to a standard set of values (think “Chevy” versus “Chevrolet”), searching and organizing this data is difficult. This paper describes a working demonstration of the Phoebus system which uses both record linkage and information extraction to parse out the meaningful attributes of an item description and assign them standard values. This allows the data to be sorted, searched and linked to other data sources where standard values for the attributes are required to link the sources together.
منابع مشابه
A Reference-set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources
This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Inste...
متن کاملAn Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look∗
There exist numerous sources of data on the World Wide Web that contain useful information but are not structured or grammatical enough to support traditional information extraction. Furthermore, even if the information extraction could be done, the extracted values would need to be standardized to ensure the queries over the source are accurate. This paper presents an automatic, scalable appro...
متن کاملApplication of Big Data Analytics in Power Distribution Network
Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...
متن کاملImpact of Feed Sources and Feeding System on Milk Production and Marketing in the Babille District of East Hararghe Zone, Ethiopia
The aim of this article was to investigate the impact of feed sources and feeding system on milk production and milk marketing in the Babille district of Eastern Hararghe zone. Data were collected using a structured questionnaire which was administered to 152 randomly selected sample dairy cow keepers in the district. Data was analyzed using descriptive methods and regression analysis. Data fro...
متن کاملIntegrating the Population Perspective into Health System Performance Assessment (IPHA): Study Protocol for a Cross-Sectional Study in Germany Linking Survey and Claims Data of Statutorily and Privately Insured
Background Health system performance assessment (HSPA) is a major tool for evidence-based governance in health systems and patient/population-orientation is increasingly considered as an important aspect. The IPHA study aims (1) to undertake a comprehensive performance assessment of the German health system from a population perspec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006